microsoft 的gpt2模型源码学习记录 您所在的位置:网站首页 layer 1 of 1 microsoft 的gpt2模型源码学习记录

microsoft 的gpt2模型源码学习记录

2023-08-15 22:22| 来源: 网络整理| 查看: 265

相关链接: gpt2论文传送门 microsoft Deepspeed gpt2源码传送

微软 Deepspeed 中集成的 gpt2 代码感觉比 huggingface 的代码可读性要强很多,这里只用作代码结构的学习,暂时忽略其中模型分片并行的部分。

(虽然感觉直接把精华给忽略了Orz)

文章目录 1. GPT2模型概述2. GPT2代码模块阅读2.1 GPT2Model主模块2.2 GPT2Transformer 模块2.3 GPT2TransformerLayer 模块2.4 GPT2SelfAttention 模块2.5 GPT2MLP 模块 3. GPT2 模型预训练3.1 GPT2 预训练 - 构造模型3.2 GPT2 预训练 - forward 参考内容

1. GPT2模型概述

GPT2 是2018年发布的预训练模型,使用超过40G的近8000万的网页文本数据对模型进行训练。

GPT-2 可以理解成是由 transforer 的decoder 堆叠成的,输入是 word embeddings + position embeddings。 transformer 模块处理单词的步骤如下:首先通过自注意力层处理,接着将其传递给神经网络层。第一个 transformer 模块处理完但此后,会将结果向量被传入堆栈中的下一个 transformer 模块,继续进行计算。每一个 transformer 模块的处理方式都是一样的,但每个模块都会维护自己的自注意力层和神经网络层中的权重。

2. GPT2代码模块阅读

GPT-2的代码模块可读性较强,整体框架如下:

在这里插入图片描述

2.1 GPT2Model主模块

GPT2Model主模块

class GPT2Model(torch.nn.Module): """GPT-2 Language model. The output of the forward method are the logits (parallel or serial depending on the `parallel_output` flag. """ def __init__(self, num_layers, vocab_size, hidden_size, num_attention_heads, embedding_dropout_prob, attention_dropout_prob, output_dropout_prob, max_sequence_length, checkpoint_activations, checkpoint_num_layers=1, parallel_output=True): super(GPT2Model, self).__init__() self.parallel_output = parallel_output init_method = init_method_normal(std=0.02) # Word embeddings (parallel). # 生成 word embedding,shape 是 vocab_size * hidden_size,用于lookup embedding self.word_embeddings = mpu.VocabParallelEmbedding( vocab_size, hidden_size, init_method=init_method) # Position embedding (serial). # position embedding,shape 是vocab_size * hidden_size,用于 每个position 的 lookup embedding,是绝对位置编码 self.position_embeddings = torch.nn.Embedding(max_sequence_length, hidden_size) # Initialize the position embeddings. init_method(self.position_embeddings.weight) # Embeddings dropout self.embedding_dropout = torch.nn.Dropout(embedding_dropout_prob) # Transformer # 构建transformer模块(后文详细说) self.transformer = mpu.GPT2ParallelTransformer(num_layers, # transformer 层数 hidden_size, num_attention_heads, # 多头attention的头数 attention_dropout_prob, output_dropout_prob, checkpoint_activations, checkpoint_num_layers) def forward(self, input_ids, position_ids, attention_mask): # Embeddings. # 根据输入 id 做 look up embeddings words_embeddings = self.word_embeddings(input_ids) # 根据位置id 做 look up embeddings position_embeddings = self.position_embeddings(position_ids) # 实际的输入是 文本+位置 embedding embeddings = words_embeddings + position_embeddings # Dropout. embeddings = self.embedding_dropout(embeddings) # Transformer. # 将 embedding 和 mask作为transformer的输入 transformer_output = self.transformer(embeddings, attention_mask) # Parallel logits. # 并行计算的logits transformer_output_parallel = mpu.copy_to_model_parallel_region( transformer_output) logits_parallel = F.linear(transformer_output_parallel, self.word_embeddings.weight) if self.parallel_output: return logits_parallel return mpu.gather_from_model_parallel_region(logits_parallel) 2.2 GPT2Transformer 模块

GPT2ParallelTransformer 模块是封装在 mpu/transformer.py 里的,mpu就是模型并行的框架了,里面封装了bert和gpt2并行训练的代码。 这里只看原理相关的部分了,暂时忽略并行的部分。

该模块是模型的主模块,即将n个的 transformer blocks 打包在一起,即 n * transformer layer + final layernorm 两部分组成。

单独的transformer layer代码详见 2.3。

class GPT2ParallelTransformer(torch.nn.Module): """GPT-2 transformer. This module takes input from embedding layer and it's output can be used directly by a logit layer. It consists of L (num-layers) blocks of: layer norm self attention residual connection layer norm mlp residual connection followed by a final layer norm. Arguments: num_layers: Number of transformer layers. hidden_size: The hidden size of the self attention. num_attention_heads: number of attention head in the self attention. attention_dropout_prob: dropout probability of the attention score in self attention. output_dropout_prob: dropout probability for the outputs after self attention and final output. checkpoint_activations: if True, checkpoint activations. checkpoint_num_layers: number of layers to checkpoint. This is basically the chunk size in checkpoitning. layernorm_epsilon: epsilon used in layernorm to avoid division by zero. init_method_std: standard deviation of the init method which has the form N(0, std). use_scaled_init_for_output_weights: If Ture use 1/sqrt(2*num_layers) scaling for the output weights ( output of self attention and mlp). """ def __init__(self, num_layers, hidden_size, num_attention_heads, attention_dropout_prob, output_dropout_prob, checkpoint_activations, checkpoint_num_layers=1, layernorm_epsilon=1.0e-5, init_method_std=0.02, use_scaled_init_for_output_weights=True, sparse_attention_config=None, max_seq_length=None): super(GPT2ParallelTransformer, self).__init__() # Store activation checkpoiting flag. self.checkpoint_activations = checkpoint_activations self.checkpoint_num_layers = checkpoint_num_layers output_layer_init_method = None if use_scaled_init_for_output_weights: output_layer_init_method = scaled_init_method(init_method_std, num_layers) # 返回一个 transformer layer(后面具体说) def get_layer(): return GPT2ParallelTransformerLayer( hidden_size, num_attention_heads, attention_dropout_prob, output_dropout_prob, layernorm_epsilon, unscaled_init_method(init_method_std), output_layer_init_method=output_layer_init_method, sparse_attention_config=sparse_attention_config, max_seq_length=max_seq_length) # Transformer layers. # 构建 num_layer 个 transformer layer self.layers = torch.nn.ModuleList( [get_layer() for _ in range(num_layers)]) # Final layer norm before output. self.final_layernorm = LayerNorm(hidden_size, eps=layernorm_epsilon) if deepspeed.checkpointing.is_configured(): global get_cuda_rng_tracker, checkpoint get_cuda_rng_tracker = deepspeed.checkpointing.get_cuda_rng_tracker checkpoint = deepspeed.checkpointing.checkpoint def forward(self, hidden_states, attention_mask): def custom(start, end): # 这里定义的 custom 函数我理解是用来加载自己的checkpoint的某些层 def custom_forward(*inputs): layers_ = self.layers[start:end] x_ = inputs[0] for layer in layers_: x_ = layer(x_, inputs[1]) return x_ return custom_forward if self.checkpoint_activations: l = 0 num_layers = len(self.layers) chunk_length = self.checkpoint_num_layers while l 4*h -> h mlp_output = self.mlp(layernorm_output) # Second residual connection. output = layernorm_input + mlp_output return output 2.4 GPT2SelfAttention 模块

这是GPT2的 self-attention模块,这里先忽略模型并行的部分,只根据原理来看对应的代码。(但是其中涉及到并行分片的部分,还是要简单看下)

这一部分是核心部分,重点其实就是多头self-attention 的运算,以及attention-mask,这部分注释写的就比较详细,搭配原作者注释一起看更好理解。

class GPT2ParallelSelfAttention(torch.nn.Module): """Parallel self-attention layer for GPT2. Self-attention layer takes input with size [b, s, h] where b is the batch size, s is the sequence lenght, and h is the hidden size and creates output of the same size. Arguments: hidden_size: total hidden size of the layer (h). num_attention_heads: number of attention heads (n). Note that we require n to be divisible by number of GPUs used to parallelize the model. Also, we require hidden size to be divisible by n. dropout_prob: dropout probability for the attention scores. init_method: weight initialization. output_layer_init_method: output layer initialization. If None, use `init_method`. We use the following notation: h: hidden_size n: num_attention_heads p: number of partitions (不做模型分片的时候p=1) np: n/p = n (p=1时) hp: h/p = h (p=1时) hn: h/n 每个attention的 hidden size b: batch size s: sequence length """ def __init__(self, hidden_size, num_attention_heads, attention_dropout_prob, output_dropout_prob, init_method, output_layer_init_method=None): super(GPT2ParallelSelfAttention, self).__init__() # Set output layer initialization if not provided. if output_layer_init_method is None: output_layer_init_method = init_method # Per attention head and per partition values. world_size = get_model_parallel_world_size() # 这里是模型分片,不分片的时候 world_size=1,无变化 self.hidden_size_per_partition = divide(hidden_size, world_size) # 这里是计算每头attention的hidden_size # 比如 hidden_size=256,指定 attention 头数=8 # 则每个attention头能够分到的 hidden size=256/8=32 # (这里是多头attention的定义,和模型分片无关) self.hidden_size_per_attention_head = divide(hidden_size, num_attention_heads) # 这里是分片后的每片模型的 attention头数,world_size=1,值为传入的 num_attention_heads # 假如 world_size=2,即模型分成2片分别放在2块GPU上跑 # 则每片模型训练 (num_attention_heads/2) 个attention self.num_attention_heads_per_partition = divide(num_attention_heads, world_size) # Strided linear layer. # 这里的 ColumnParallelLinear 是一个基于模型分片的线性变化层,本质上就是一个 y = x * W + b 的操作 # 将 输入为 hidden_size 变换成 3*hidden_size # 当 模型不分片的时候 stride 和 gather_output 这两个参数是没有用的 # (知道是什么样的操作就行,这里就不细看了) self.query_key_value = ColumnParallelLinear(hidden_size, 3*hidden_size, stride=3, gather_output=False, init_method=init_method) # Dropout. Note that for a single iteration, this layer will generate # different outputs on different number of parallel partitions but # on average it should not be partition dependent. # 对 attention 做 dropout,这里作者注释,对于每一次迭代,不同分片上的模型参数,将会有不同的dropout结果,但是理论上来说取平均不会受到模型分片的影响。 # (应该是这个意思,解释了一下在分片的情况下dropout不受影响) self.attention_dropout = torch.nn.Dropout(attention_dropout_prob) # Output. # 这里是做一个变换,定义weight=[h,h] self.dense = RowParallelLinear(hidden_size, hidden_size, input_is_parallel=True, init_method=output_layer_init_method) self.output_dropout = torch.nn.Dropout(output_dropout_prob) if deepspeed.checkpointing.is_configured(): global get_cuda_rng_tracker, checkpoint get_cuda_rng_tracker = deepspeed.checkpointing.get_cuda_rng_tracker checkpoint = deepspeed.checkpointing.checkpoint def _transpose_for_scores(self, tensor): """Transpose a 3D tensor [b, s, np*hn] into a 4D tensor with size [b, np, s, hn]. 模型不分片的时候,np=n,hn=h/n,所以3d矩阵实际上就是 [b, s, h],需要将其按照attention 头数进行分解,得到一个 [b, n, s,hn] """ # 先计算目标矩阵的shape:(b,s) + (np, hn) = (b, s, np, hn) new_tensor_shape = tensor.size()[:-1] + \ (self.num_attention_heads_per_partition, self.hidden_size_per_attention_head) # 将当前矩阵分解为目标 shape tensor = tensor.view(*new_tensor_shape) return tensor.permute(0, 2, 1, 3) def forward(self, hidden_states, ltor_mask): # hidden_states: [b, s, h] # ltor_mask: [1, 1, s, s] # Attention heads. [b, s, hp] # p=1时为正常的 [b, s, h] # query_key_value 运算: [b, s, h] -> [b, s, 3*h] # 因为是self-attention,所以 qw kw vw都是和 hidden states相乘,所以这里做运算的时候 [b, s, h] * [h, 3h] -> [b, s, 3*h] # 相当于是把 qw kw vw 拼在了一起计算,后面把最后一维拆分即可 mixed_x_layer = self.query_key_value(hidden_states) # split_tensor_along_last_dim 函数是将输入矩阵的最后一维均匀分成n份,这里是 3 份 # 这里计算就是将 q k v 从上面得到的矩阵中份离开 # q k v 的 shape 都是[b, s, h] (mixed_query_layer, mixed_key_layer, mixed_value_layer) = split_tensor_along_last_dim(mixed_x_layer, 3) # Reshape and transpose [b, np, s, hn] # 根据attention的头数,将q k v 进行分解, np * hn = h (p=1时) query_layer = self._transpose_for_scores(mixed_query_layer) key_layer = self._transpose_for_scores(mixed_key_layer) value_layer = self._transpose_for_scores(mixed_value_layer) # Raw attention scores. [b, np, s, s] # q * k 得到 attention score attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2)) # q * k / sqrt(hn) attention_scores = attention_scores / math.sqrt( self.hidden_size_per_attention_head) # Apply the left to right attention mask. # 做 mask 运算,注意此时 attention-score的shape是[b, np, s, s] # ltor_mask 的shape 也是[1,1,s,s],它是一个上三角全0 下三角全1的矩阵 # 两者做 哈达玛积(就是逐元素相乘) # 只保留当前word之前的attention score,之后的都-1000,即设置为极小值 attention_scores = torch.mul(attention_scores, ltor_mask) - \ 10000.0 * (1.0 - ltor_mask) # Attention probabilities. [b, np, s, s] # 对最后一维做softmax,即可得到每个位置的 attention_probs attention_probs = torch.nn.Softmax(dim=-1)(attention_scores) # This is actually dropping out entire tokens to attend to, which might # seem a bit unusual, but is taken from the original Transformer paper. # 作者注释:这里做attention dropout实际上会删掉一些 attention 得分,这可能有点问题。(确实。。。。这里留个疑问吧) with get_cuda_rng_tracker().fork(): attention_probs = self.attention_dropout(attention_probs) # Context layer. # [b, np, s, hn] # 这里做点积,[b, np, s, s] * [b, np, s, hn] -> [b, np, s, hn] context_layer = torch.matmul(attention_probs, value_layer) # [b, s, np, hn] # 这里是多头合并,先把维度调换回去 context_layer = context_layer.permute(0, 2, 1, 3).contiguous() # 然后计算合并后的shape (b,s)+(h) = (b,s,h),这里还是假设模型分片=1(不分片) new_context_layer_shape = context_layer.size()[:-2] + \ (self.hidden_size_per_partition,) # [b, s, hp] 合并成目标shape context_layer = context_layer.view(*new_context_layer_shape) # Output. [b, s, h] # 输出,过一个dense+dropout # 这里的dense层就是之前定义的RowParallelLinear,当前模型分片=1的话,就是 [b, s, h] * [h, h] -> [b, s, h] output = self.dense(context_layer) output = self.output_dropout(output) return output 2.5 GPT2MLP 模块

这里GPT2的MLP模块,其实就是将最后一维 hidden states 做了个非线性变换:h -> 4h -> h

class GPT2ParallelMLP(torch.nn.Module): """MLP for GPT2. MLP will take the input with h hidden state, project it to 4*h hidden dimension, perform gelu transformation, and project the state back into h hidden dimension. At the end, dropout is also applied. Arguments: hidden_size: The hidden size of the self attention. output_dropout_prob: dropout probability for the outputs after self attention and final output. init_method: initialization method used for the weights. Note that all biases are initialized to zero and layernorm weight are initialized to one. output_layer_init_method: output layer initialization. If None, use `init_method`. """ def __init__(self, hidden_size, output_dropout_prob, init_method, output_layer_init_method=None): super(GPT2ParallelMLP, self).__init__() # Set output layer initialization if not provided. if output_layer_init_method is None: output_layer_init_method = init_method # Project to 4h. # 这里和上文一样,不要被名字唬住,其实就是 y=wx+b计算 # [b,s,h] * [4h,h]T -> [b,s,4h] # 即 weight shape = [outout,input] ,如果模型需要分片,则对 output_size分片 self.dense_h_to_4h = ColumnParallelLinear(hidden_size, 4*hidden_size, gather_output=False, init_method=init_method) # Project back to h. # y=xw+b 运算,[b,s,4h] * [h,4h]T -> [b,s,h] # 即 weight shape = [outout,input] ,如果模型需要分片,则对 input_size分片 self.dense_4h_to_h = RowParallelLinear( 4*hidden_size, hidden_size, input_is_parallel=True, init_method=output_layer_init_method) self.dropout = torch.nn.Dropout(output_dropout_prob) def forward(self, hidden_states): # [b, s, 4hp] intermediate_parallel = self.dense_h_to_4h(hidden_states) intermediate_parallel = gelu(intermediate_parallel) # [b, s, h] output = self.dense_4h_to_h(intermediate_parallel) output = self.dropout(output) return output

到这里,GPT-2的模型主要结构的代码就看完了。

3. GPT2 模型预训练

接下来从forward部分简单看下 GPT2 做预训练的时候如何构造损失函数。

这部分代码在 pretrain_gpt2.py 中

3.1 GPT2 预训练 - 构造模型 def get_model(args): """Build the model.""" print_rank_0('building GPT2 model ...') # 这里构造的就是第2小节详细说的 GPT2Model model = GPT2Model(num_layers=args.num_layers, vocab_size=args.vocab_size, hidden_size=args.hidden_size, num_attention_heads=args.num_attention_heads, embedding_dropout_prob=args.hidden_dropout, attention_dropout_prob=args.attention_dropout, output_dropout_prob=args.hidden_dropout, max_sequence_length=args.max_position_embeddings, checkpoint_activations=args.checkpoint_activations, checkpoint_num_layers=args.checkpoint_num_layers, parallel_output=True) if mpu.get_data_parallel_rank() == 0: print(' > number of parameters on model parallel rank {}: {}'.format( mpu.get_model_parallel_rank(), sum([p.nelement() for p in model.parameters()])), flush=True) #To prevent OOM for model sizes that cannot fit in GPU memory in full precision # 使用 deepspeed 和 fp16 的时候 # 仅仅在权重更新的时候使用fp32,耗时的前向和后向运算都使用fp16 # half()方法将模型中的float32转化为float16 if args.deepspeed and args.fp16: model.half() # GPU allocation. # 显示地将模型加载到GPU上 model.cuda(torch.cuda.current_device()) # Fp16 conversion. # 使用 fp16 混合精度可以有效节省内存,这部分可以另外写个代码分析,这里就不展开说了 if args.fp16: model = FP16_Module(model) # Wrap model for distributed training. if USE_TORCH_DDP: i = torch.cuda.current_device() model = DDP(model, device_ids=[i], output_device=i, process_group=mpu.get_data_parallel_group()) else: model = DDP(model) return model 3.2 GPT2 预训练 - forward

这里是预训练过程中的 forward。

过程也很简单,先走model forward,计算得到 GPT-2 的output,然后计算loss。 这里的 input 是sentence[:-1] true label是 sentence[1:],即,对于长度为seq_len 的输入,1~seq_len - 1 个token是 input,后 2~seq_len 个token是label。

def forward_step(data_iterator, model, args, timers): """Forward step.""" # Get the batch. timers('batch generator').start() tokens, labels, loss_mask, attention_mask, position_ids = get_batch( data_iterator, args, timers) timers('batch generator').stop() # Forward model. # output shape = [b,s,vocab_size] # output 这里 seq_len 上的每个位置的 hidden_states 都可以理解成为,已知了前n个token,当前位置的预测 token output = model(tokens, position_ids, attention_mask) # output在最后一维上取最大值作为预测值,计算和label直接的交叉熵 losses = mpu.vocab_parallel_cross_entropy(output.contiguous().float(), labels) # 这里的 loss_mask 是将 end_token mask 掉 loss_mask = loss_mask.view(-1) loss = torch.sum(losses.view(-1) * loss_mask) / loss_mask.sum() return loss 参考内容 完全图解GPT-2:看完这篇就够了(一)预训练模型专题_GPT2_模型代码学习笔记(这个博主做了huggingface gpt2代码的阅读笔记,可以一起学习下)


【本文地址】

公司简介

联系我们

今日新闻

    推荐新闻

    专题文章
      CopyRight 2018-2019 实验室设备网 版权所有